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Abstract 

We derive a stochastic gradient algorithm for semidefinite optimization using randomization 
techniques. The algorithm uses subsampling to reduce the computational cost of each iteration 
and the subsampling ratio explicitly controls granularity, i.e. the tradeoff between cost per 
iteration and total number of iterations. Furthermore, the total computational cost is directly 
proportional to the complexity (i.e. rank) of the solution. We study numerical performance on 
some large-scale problems arising in statistical learning. 

1 Introduction 

Beyond classic combinatorial relaxations [GW95J, semidefinite programming has recently found a 
new stream of applications in machine learning [LCB + 02] . geometry [WS06J, statistics [dB EG06j 
or graph theory [SBXD06J. All these problems have a common characteristic: they have relatively 
low precision targets but form very large semidefinite programs for which obtaining second order 
models is numerically hopeless which means that Newton based interior point solvers typically fail 
before completing even a single iteration. Early efforts focused on exploiting structural properties 
of the problem (sparsity, block patterns, etc), but this has proven particularly hard for semidefinite 
programs. For very large problem instances, first-order methods remain at this point the only 
credible alternative. This follows a more general trend in optimization which seeks to significantly 
reduce the granularity of solvers, i.e. reduce the per iteration complexity of optimization algorithms 
rather than their total computational cost, thus allowing at least some progress to be made on 
problems that are beyond the reach of current algorithms. 

In this work, we focus on the following spectral norm minimization problem 



minimize 
subject to y £ Q 



(D 



in the variable y £ R p , with parameters Aj 6 S n , for j = 1, . . . ,p, b £ R p and C £ S n , where Q 
is a compact convex set. Throughout the paper, we also implicitly assume that the set Q C R p is 
simple enough so that the complexity of projecting y on Q is relatively low compared to the other 
steps in the algorithm. 

The idea behind this paper stems from a recent result by [JLNS09] , who used a mirror descent 
stochastic approximation algorithm for solving bilinear matrix games (see [Ncs09j, [PJ92] or [NY83 j 
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for more background), where subsampling is used to perform matrix vector products and produce 
an approximate gradient. Strikingly, the algorithm has a total complexity of 0(n log re/e 2 ), when 
the problem matrix is re x n, hence only requires access to a negligible proportion of the matrix 
coefficients as the dimension re tends to infinity 

In parallel, recent advances in large deviations and random matrix theory have produced a 
stream of new randomization results for high dimensional linear algebra (see [FKV04} IDKM 06, 
AM07, KV09] among many others), motivated by the need to perform these operations on very 
large scale, sometimes streaming, data sets in applications such as machine learning, signal process- 
ing, etc. Similar subsampling techniques have been successfully applied to support vector machine 
classification [KBH08] or Fourier decomposition. Randomization results were used in |AK07| to 
produce complexity bounds for certain semidefinite programs arising in combinatorial relaxations 
of graph problems. Randomization was also used in [BLO02J and [BLO05J to approximate subdif- 
ferentials of functions that are only differentiable almost everywhere. 

Our contribution here is to further reduce the granularity of first-order semidefinite program- 
ming solvers by combining subsampling procedures with stochastic approximation algorithms to 
derive stochastic gradient methods for spectral norm minimization with very low complexity per 
iteration. In practice, significantly larger per iteration complexity and memory requirements mean 
that interior point techniques often fail to complete a single iteration on very large problem in- 
stances. CPU clock also runs much faster than RAM, so operations small enough to be performed 
entirely in cache (which runs at full speed) are much faster than those requiring larger data sets. 
Solver performance on very large problem instances is then often more constrained by memory 
bandwidth than clock speed, hence everything else being equal, algorithms running many cheap 
iterations will be much faster than those requiring fewer, more complex ones. Here, subsampling 
techniques allow us to produce semidefinite optimization algorithms with very low cost per itera- 
tion, where all remaining 0(n 2 ) operations have a small constant and can be performed in a single 
pass over the data. 

We also observe that the relative approximation error in computing the spectral norm (or 
trace norm) of a matrix using subsampling is directly proportional to the numerical rank of that 
matrix, hence another important consequence of using subsampling techniques to solve large-scale 
semidefinite programs is that the total complexity of running the algorithm becomes explicitly 
dependent on the complexity (i.e. rank) of its solution. 

The paper is organized as follows. Section [2] surveys some key results on randomized linear 
algebra and spectral norm approximations. In Section [3] we then derive a stochastic approxima- 
tion algorithm for spectral norm minimization with very low cost per iteration and discuss some 
extensions to statistical learning problems. Finally, we present some numerical experiments in 
Section 01 

Notation 

We write S n the set of symmetric matrices of dimension re. For a matrix X E R mxn , we write ||^||f 
its Frobenius norm, ||AT|| tr its trace norm, ||X||2 its spectral norm, at(X) its i-th largest singular 
value and let H-X^loo = max^ \Xij\, while X^ is the i-th column of the matrix X and Xu\ its i-th 
row. We write vec(X) the vector of R mn obtained by stacking up the columns of the matrix X 
and NumRank(J) the numerical rank of the matrix X, where NumRank(X) = ||X|||i/||X||2. 
Finally, when x E R n is a vector, we write ||x[|2 its Euclidean norm, while || • || is a general norm 
on R m and || • IL its dual norm. 
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2 Randomized linear algebra 



In this section, we survey several results by [DKM06] which, after a single pass on the data, sample 
columns to approximate matrix products and produce low rank matrix approximations with a 
complexity of 0{sn) where s is the sampling rate. 

2.1 Randomized matrix multiplication 
Algorithm 1 Matrix multiplication 

Input: A G R mxn , B G R nxp and s such that 1 < s < n. 
1: Define a probability vector q G R n such that 



\\A®b\\B, 



«ll2 



ZU\\AU)\\ 2 \\B U) \\ 2 > 



1, . . . ,n. 



Define subsampled matrices C G R mxs an d R G R sxp as follows, 
for i = 1 to s do 

Pick j G [1, n] with P(j = I) = q[. 
Set CW = AW/^sqj and i? w = B {j) /^/sq]. 
end for 

Output: Matrix product Ci? approximating 

By construction, we have E[CR] = AB, and the following randomization result from [DKM07] 
controls the precision of the approximations in algorithm [TJ 

Lemma 1 Let A G R mxn , B G R nxp ; given a subsampling rate s such that 1 < s < n, suppose 
that C G R mxs and R G R sxp are computed according to algorithm^ above, then 

n\\AB-CR\\l\<\\\Af F \\B\\ 2 F 



and if (3 G [0, 1] with rj = 1 + y/S\og(l/($) then 

\\AB-CRf F < — 
s 

with probability at least 1 — j3. 

Proof. See Theorem 1 in |DKM07j . ■ 

Note that using the adaptive probabilities qi is crucial here. The error bounds increase by a 
factor n when % = 1/n for example. 

2.2 Randomized low-rank approximation 

Algorithm [2] below computes the leading singular vectors of a smaller matrix S, which is a 
subsampled and rescaled version of X. Here, the computational savings come from the fact that 
we only need to compute singular values of a matrix of dimension m x s with s < n. Recall that 
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Algorithm 2 Low-rank approximation 



Input: X G R mxn and k, s such that 1 < k < s < n. 
1: Define a probability vector q G R n such that q. t = \\X® IH/II^IIf) f° r i = 1, ■ ■ • ,n. 
2: Define a subsampled matrix S G R mxs as follows. 
3: for i = 1 to s do 

4: Pick an index j G [1, n] with P(j = I) = qi. 
5: Set 5® = XW/^/Sqj. 
6: end for 

7: Form the eigenvalue decomposition S T S = Y diag(a)Y T where Y G R sxs and cr G R s . 
8: Form a matrix # G R mxfc with ffW = SY®/<r\ /2 . 
Output: Approximate singular vectors if®, i = 1, . . . , fc. 



computing k leading eigenvectors of a symmetric matrix of dimension s only requires matrix vector 
products, hence can be performed in 0(ks 2 log s) operations using iterative algorithms such as the 
power method or Lanczos method (see the appendix for details, as usual we omit the precision 
target in linear algebra operations, implicitly assuming that it is much finer than e), so that the 
cost of computing k leading singular vectors of a matrix of size raxsis 0{ksm log m). 

This means that, given the probabilities the total cost of obtaining k approximate singular 
vectors using algorithm [2] is 0{ksm\ogm) instead of 0{knm\ogm) for exact singular vectors. Of 
course, computing qi requires mn operations, but can be done very efficiently in a single pass over 
the data. We now recall the following result from [DKM06J which controls the precision of the 
approximations in algorithm [2J 

Lemma 2 Let X G R mx ™ anc [ l < k < s < n. Given a precision target e > 0, if s > 4/e 2 and 
H G R mxfc is computed as in algorithm^ we have 

E[||X - H k HlXf 2 ] < \\X - X k f 2 + e||X||| 

and if in addition s > Arj 1 /e 2 where rj = 1 + \/81og(l//3) for (3 G [0, 1], then 

\\X — H k H k X\\l < \\X — X k \\l + e||X|||p 

with probability at least 1 — (3, where X k is the best rank k approximation of X. 

Proof. See Theorem 4 in |DKM06j . ■ 

An identical precision bound holds in the Frobenius norm when s > 4/c/e 2 . We now adapt these 
results to our setting in the following lemma, which shows how to approximate the spectral radius 
of a symmetric matrix X using algorithm O 

Lemma 3 Let X G R mxn and (3 G [0,1]. Given a precision target e > 0, construct a matrix 
S G R mxs by subsampling the columns of X as in algorithm^ Let rj = 1 + Y / 81og(l//3) and 

\\X\\ 2 

s = t] 2 NumRank(I) 2 (2) 

e z 

we have 

B[\\\S\\ 2 -\\X\\ 2 \]<e 
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and 

\\\S\\ 2 - \\X\\ 2 \ < e 

with probability at least 1 — j3. 

Proof. Using the Hoffman- Wielandt inequality (see [5S901 Th. 3.1] or the proof of [DKM061 Th.2] 
for example) we get 

\\\S\\l - Pllll < \\SS T -XX T \\ F 



hence 

lll^lb - pr|| 2 | < \\ss T - xx T \\ F /\\x\\ 2 

and Jensen's inequality together with the matrix multiplication result in Lemma [U yields 

E[\\SS T -XX T \\ F ] < 

and 



\ss T -xx T \\ F < 



with probability at least 1 — f3. Combining these two inequalities with the sampling rate in §2$ 

2 \\X\\ F 

yields the desired result. ■ 

The subsampling rate required to achieve a precision target e has a natural interpretation. 
Indeed 

llxii 2 

s = r] 2l -^P- NumRank(I) 2 

is simply the squared ratio of the numerical rank of the matrix X over the relative precision tar- 
get e/||X||2, times a factor rj 1 controlling the confidence level. The numerical rank NumRank(X) 
always satisfies 1 < NumRank(X) = ||X|||,/||X||2 < Rank(X) and can be seen as a stable relax- 
ation of the rank of the matrix X (see [RV07] for a discussion). Note also that, by construction, 
the subsampled matrix always has lower rank than the matrix X. The expectation bound is still 
valid if we drop the factor r\. 



3 Stochastic approximation algorithm 

Below, we will use a stochastic approximation algorithm to solve problem (pQ) when the gradi- 
ent is approximated using the subsampling algorithms detailed above. We focus on a stochastic 
approximation of problem ([T|) written 

- b T y (3) 

in the variable y G R p and parameters Aj £ S n , for j = 1, . . . ,p, b € R p and C £ S n , with 1 < s < n 
controlling the sampling rate, where the function ||7r^(S?=i VjAj + C) 1 1 2 and a subgradient with 



mm f(y) 



E 



7T 



ELi yjAj + c 
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respect to y are computed using algorithms Q] and [2J For X G S n , we have written tt^ s \X) the 
subsampling/scaling operation used in algorithms [U and [2] with 

n^(X) = S, (4) 

where < s < n controls the sampling rate and S G R nxs is the random matrix defined in 
algorithm [2] whose columns are a scaled sample of the columns of X. We will write S = irr s \(X) the 

2 

matrix obtained by subsampling rows as in algorithm [TJ We also define A G R n xp as the matrix 
whose columns are given by A^' = vec(Aj), j = 1, . . . ,p. 



3.1 Stochastic approximation algorithm 

We show the following lemma approximating the gradient of the function ||7r^(^^ =1 yjAj + C) 1 1 2 
with respect to y and bounding its quadratic variation. 

2 

Lemma 4 Given Aj G S n with A G R n xp defined as above, for j = 1, . . . ,p, b G R p ; C G S n and 
sampling rates s\ and S2, a (stochastic) subgradient of the function \\^ sl '{Y^=i J/j-A? + C) 1 1 2 — b T y 
with respect to y is given by the vector w G R p with 

w = A T vec(vv T ) — b 

where v G R n is a leading singular vector of the subsampled matrix S = tt^ Si \X) formed in 
algorithm^ Furthermore, the product A T vec(vv T ) can be approximated using algorithmic to form 
an approximate gradient 

g = ^XA T )-n {s2) {vec{vv T ))-b, 



which satisfies 



Elg}=A T vec(vv T )-b£df(y) and E[\\gf 2 ) < 2^Ii + 2 ||6|||. (5) 



Proof. Iterated expectations give E[g] = E[w] G df(y). The sampling probabilities % used in 
approximating the matrix vector product A T vec(vv T ) following algorithm Q] are defined as 

\\-A(i)h\ vec(vv T )i\ . 

1 = 1, . . . , n. 



£- =1 IIA;)IWvec( 



OV T )j\ 



As in [DKM07, Lemma 3], the quadratic variation of the approximate product 7r^ S2 \A T ) ir( S2 -- ) (vec(vv T 
is then given by 

nw^W - (S2 )(vec(^))ni] = e£i Mw t:: HT) ' - 

With pi defined as above, we get 



^ P w |||vec(^ T ) t 2 < (EtljAdbvec^ 



4 = 1 



|2 



\A 



S-2 
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by Cauchy-Schwarz, because || •vec(vv T )\\2 = \\vv T Wp = \\vW2 = 1, hence the desired result. ■ 

We now use this result to produce an explicit bound on the complexity of solving problems ([3]) 
and ([XJ) by subsampling using a stochastic approximation algorithm. In this section, we let || • || be 
a general norm on R p , we write || • ||* its dual norm and define as the smallest number such 
that ||y|| 2 < <5*(p)|M|* for all y G R p . Following the notation in [JLNS091 §2.3], we let ui(x) be a 
distance generating function, i.e. a function such that 

Q° = I x G Q : 3y G R m , x G argmm[y T u + u(u)] > 
[ ueQ J 

is a convex set. We assume that u>(x) is strongly convex on Q° with modulus a with respect to the 
norm || • ||, which means 

(y - x) T (Vto{y) - Voj(x)) > a\\y - x\\ 2 , x, y G Q°. 

We then define a prox-function V(x,y) on Q° x Q as follows: 

V(x, y) = Lo{y) - [u){x) + Vu(x) T (y - x)], 

which is nonnegative and strongly convex with modulus a with respect to the norm || • ||. The 
prox-mapping associated to V is then defined as 

P x' U (v) = argmin{y T (z - x) + V(x, z)}. (6) 
zeQ 

Finally, we define the u diameter of the set Q as: 

D u q = (maxw(2;) - minu;(2;)) 1 / 2 (7) 
zeQ zeQ 

and we let 7; for I = 0, . . . , N be a step size strategy. 
Algorithm 3 Spectral norm minimization using subsampling 

Input: Matrices Aj G S n , for j = 1, . . . , p, b G R p and C G S n , sampling rates si and ^2- 
1: Pick initial yo G Q 
2: for / = 1 to N do 

3: Compute v G R n , the leading singular vector of the matrix tt^ 1 ^ (Y%=i yi,jAj+C) , subsampled 

according to algorithm [2] with k = 1 and s = s\. 
4: Compute the approximate subgradient g\ = 7r( S2 )(^4 T ) TT^(-vec(vv T )) — b, by subsampling 

the matrix product using algorithm Q] and s = S2- 
5: Set y i+1 = P^lnigi)- 

6: Update the running average y N = Ylk=o HVll Y^k=o 7z- 
7: end for 

Output: An approximate solution y^ G R p of problem (J3]) with high probability. 



The following results control the convergence of the robust stochastic approximation algorithm [3] 
(see |JLJNS09j . |JNes09j . [PJ92] or |JNY83j for further details). We call y the optimal solution of 
problem ([3]), the lemma below characterizes convergence speed in expectation. 
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Lemma 5 Given N > 0, let M* be defined as in ^) by 

2 



Mi 



•32 



+ 21 



12) 



using a fixed step size strategy with 



11 



D, 



u>,Q 



we have, after N iterations of algorithm [3| 



1 = 1, 



,N 



W(m) - f{y)\ < D^ Q 5i(p)MtJ^ 



and 



Km) - Kv) > t 



with probability less than t v — ^ . 

Proof. By construction E[||g||^] < 6 2 (p)M%, the rest follows from |JLNS09[ §2.3] for example. 
Lemma [5] means that we need at most 

2Dl Q 5Kp)M? 



N 



ae 2 (3 2 



iterations to get an e solution to problem ([3]) with confidence at least 1 — (3. Typically, the prox 
function lo and the norm are chosen according to the geometry of Q, to minimize N. The choice of 
norm also affects 6*(p) and obtaining better bounds on M* in ([5]) for generic norms would further 
tighten this complexity estimate. 

We now call y* the solution to the original (deterministic) spectral norm minimization prob- 
lem ([I]) and bound the suboptimality of yjv in the (true) problem ([T|) with high probability. 



Theorem 1 If the sampling rate s\ is set to 

2 



Si 



NumRank y*Aj + C 



then after 



iterations of algorithm 0, we have 



N 



2Dl Q 5l{p)Ml 



ae 2 (5 2 



(8) 



(9) 



V 








VNjAj + C 


- b T y N - 


X>H + C 


+ b T y* < 2e 


3=1 


2 


3=1 


2 



with probability at least 1 — f3. 
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Proof. Recall that we have written y* the solution to the original (deterministic) problem JT]), 
y the solution to the approximate (stochastic) problem ([3]) and y^ the iV-th iterate of algorithm [3] 
above. Lemma [5] on the convergence of y^ to the solution of the stochastic problem in ([3]) means 

f(m) - f(y) < e 

with probability at least 1 — (3. By definition, y minimizes the stochastic problem, so in particular 
f(y) < f(y*) an d we have in fact 

f(m) - f(y*) < e. (10) 

with probability at least 1 — (5. Now, with s% defined as above, Lemma [3] on the quality of the 
subsampling approximation to ||.||2 shows that if the sampling rate is set as in (|8|) then 



E 



EU y*Aj + c - vrW [Y,U y*i A i + c 



< e 



and Jensen's inequality yields 



YlUy^ + C) b T y*-f(y*) 



< e. 



which bounds the difference between the minimum of the (true) problem in ([I]) and the value f(y*) 
of its stochastic approximation in ([3|), combining this with inequality (|10p we finally get that 

f(m) - I (E U y) A 3 + c) || 2 + b T y* < 26. 

with probability at least 1 — /?, which is the desired result. ■ 

This result allows us to bound the oracle complexity of solving ([T|) by subsampling. In practice 
of course, both the spectral norm and the numerical rank of the solution matrix Ef=i Vj^j + @ are 
unknown. However, assuming we have a stopping criterion, i.e. a function which efficiently certifies 
that a given y £ R p is optimal, we can search for the minimum sampling rate in ([8]) by starting 
from a low target and doubling the sampling rate until we obtain an optimal solution. The simple 
lemma below explicitly summarizes the complexity of this procedure. 

Lemma 6 Suppose we start from a sampling rate s = 1 and run algorithm [3] repeatedly, doubling 
the sampling rate until the stopping criterion certifies the solution is optimal. Then, with probability 
at least 1 — (5, algorithm needs to be run at most 

r io S2( s i)l 

times, where s\ is given in |2|), before it finds an optimal solution to problem (Qp. 

Proof. Starting from s = 1, we simply need to double the sampling rate at most [log 2 (si)] before 
it becomes larger than s\. At the sampling rate s = s\, algorithm [3] will produce an optimal solution 
with probability 1 — (3. ■ 

In fact, we will see below that the complexity of each iteration is dominated by a term 
0{sn log n), where s is the sampling rate, and because 

ri°g2(>i)i 

V 2 i < 2r io S2(si)l+i < 4 Sl 

i=l 
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we then observe that searching for the minimal sampling rate by repeatedly solving ([3]) for increasing 
sampling rates will be less than four times as expensive as solving the problem in the oracle case. 

Typically, producing a stopping oracle means computing a duality gap and we will show in §3.31 
how this can be done efficiently here. Note that in the absence of such a stopping criterion, the 
minimum sampling rate in (|8|) has to be enforced over all matrices X = Yjj=iVj^j + ^> wmc h 
considerably increases overall complexity. The next section provides a detailed analysis of the 
complexity of algorithm [3] as a function of e, s\ and S2- 

3.2 Complexity 

We now study in detail the complexity of algorithm O Suppose we are given a precision target e 
and fix the sampling rate S2 arbitrarily between 1 and n 2 , with the sampling rate s\ set as in 
Theorem [TJ The cost of each iteration in algorithm [3] breaks down as follows. 

• On line 3: Computing the leading singular vector v, using algorithm [2] with k = 1. This 
means first computing the probabilities qi at a cost of 0{n 2 ) operations. Forming the matrix 
S = 7i^ sl '(X^f=i VljAj + C) costs 0{ns\) operations. It remains to compute the leading 
singular vector of S using the Lanczos method at a cost of O(sinlogn) (cf. the appendix for 
details). The total numerical cost of this step is then bounded by c\n 2 + cins\ where c\ and 
C2 are absolute constants. Here, c\ is always less than ten while C2 is the number of iterations 
required by the Lanczos method to reach a fixed precision target (typically le-8 or better 
here) hence we have c\ <C ci- 

• On line 4: Computing the approximate subgradient g\ = tt^ S2 \A t ) 7T( S2 )(vec(TO T )) — b, by 
subsampling the matrix product using algorithm [TJ This means again forming the vector q at 
a cost of 0(n 2 ) (the row norms of A can be precomputed) . Computing the subsampled matrix 
vector product then costs 0{ps2)- Both of these complexity bounds have low constants. 

• On line 5: Computing the projection yi + \ = Pyj'" (ll9l) > whose numerical cost will be denoted 
by c(p). 

Let us remark in particular that all 0(n 2 ) operations above only require one pass over the data, 
which means that the entire data set does not need to fit in memory. Using the bound on the 
quadratic variation of the gradient computed in Lemma [U we can then bound the number of 
iterations required by algorithm [3] to produce a e-solution to problem (pQ) with probability at least 
1 — 13. Let us call Y* = Y^=x Uj^j + ^> anc ^ recan th & t T) = 1 + y / 81og(l//3), Table UJ summarizes 
these complexity bounds and compares them with complexity bounds for a stochastic approximation 
algorithm without subsampling. 



Complexity 


Stoch. Approx. 


Stoch. Approx. with Subsampling 


Per Iter. 
Num. Iter. 


C4n 2 p + c(p) 

2^,Q5*(p) 2 (||-4|l! + ||b||!) 2 


c\n 2 + C3PS2 + C2nlogn rj 2 ^ Y NumRank(y*) 2 + c(p) 

^ Q s*( P ?(^+\\b\\if 







Table 1: Complexity of solving problem (jTJ) using subsampled stochastic approximation 
method versus original algorithm. Here c±, . . . , C4 are absolute constants with ci, C3 -C 02,04. 
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We observe that subsampling affects the complexity of solving problem ([I]) in two ways. De- 
creasing the (matrix product) subsampling rate S2 £ [l,^ 2 ] decreases the cost of each iterations 
but increases the number of iterations in the same proportion, hence has no explicit effect on the 
total complexity bound. In practice of course, because of higher cache memory speed and better 
bandwidth on smaller problems, cheaper iterations tend to run more efficiently than more complex 
ones. The impact of the (singular vector) subsampling rate s± £ [l,n] is much more important 
however, since computing the leading eigenvector of the current iterate is the most complex step in 
the algorithm when solving problem ([T]) using stochastic approximation. Because ci,C3 <C C2, the 
per iteration complexity of solving large-scale problems essentially follows 

2 jl^L NumRank(y*) 2 

hence explicitly depends on both the numerical rank of the solution matrix Y* = Y2j=i V*j^j + & 
and on the relative precision target e/||Y"*||2. This means that problems with simpler solutions will 
be solved more efficiently than problems whose solutions has a high rank. 

The choice of norm || • || and distance generating function also has a direct impact on complexity 
through c(p) and 5*(p)M*. Unfortunately here, subsampling error bounds are only available in the 
Frobenius and spectral norms hence part of the benefit of choosing optimal norm/distance gener- 
ating function combinations is sometimes lost in the norm ratio bound 5*(p). However, choosing a 
norm/prox function combination according to the geometry of Q can still improve the complexity 
bound compared to a purely Euclidean setting. 

Finally, subsampling can have a more subtle effect on complexity. By construction, solutions to 
problem (TjQ) tend to have multiple leading singular values which coalesce near the optimum. Intro- 
ducing noise by subsampling can potentially break this degeneracy and increase the gap between 
leading eigenvalues. Since the complexity of the algorithm depends in great part on the complexity 
of computing a leading singular vector using iterative methods such as the power method or the 
Lanczos method (cf. Appendix), and the complexity of these methods decreases as the gap between 
the two leading singular values increases, subsampling can also improve the efficiency of iterative 
singular value computations. 

3.3 Surrogate Duality Gap 

In practice, we often have no a priori knowledge of NumRank(Y*) 2 and if the sampling rate s 
is set too low, it's possible for the algorithm to terminate at a suboptimal point Y where the 
subsampling error is less than e (if the error at the true optimal point Y* is much larger than e). In 
order to search for the optimal sampling rate s as in Lemma El we first need to check for optimality 
in (TjQ) and we now show how to track convergence in algorithm [3] by computing a surrogate duality 
gap, at a cost roughly equivalent to that of computing a subgradient. The dual of problem (TTJ is 
written 

maximize Tr(CX) — Sq(w) 

subject to Wj = bj — Tr(AjX), j = 1, . . . ,p (11) 

ll^lltr < 1, 

in the variables X 6 S n and uu £ R p , where Sq(v) is the support function of the set Q, defined as 

Sq(w) = max w T y. 
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For instance, when Q is an Euclidean ball of radius B, problem (jlip becomes 

maximize Tr(CX) — B||u>||2 

subject to Wj = bj — Tr(AjX), j = l,...,p (12) 

ll^lltr < 1, 

in the variables X £ S n and w G R p . The leading singular vector v in algorithm [3] always satisfies 
||^f T ||tr < 1) hence we can track convergence in solving (pQ) by computing the following surrogate 
duality gap 

-b T y-v T Cv + S Q (w) (13) 

2 

where u;, = bj — v T AjV for i = 1, . . . ,p. 




3.4 Minimizing the sum of the k largest singular values 

Motivated by applications in statistical learning, we now discuss direct extensions of the results 
above to the problem of minimizing the sum of the k largest singular values of an affine combination 
of matrices, written 

k 

™§ E ** (U=i + c )~ hT y ( 14 ) 

in the variable y £ R p , with parameters Aj £ S n , for j = 1, . . . ,p, b & R p and C € S n . As in the 
previous section, we also form its stochastic approximation 

mm/( 2/ )^E[Ei 1 ^( 7 rW(Ef =1 %A + C'))] - b T y (15) 

in the variable y £ R p , with 1 < s < n controlling the sampling rate. We now prove an analog of 
Lemma [3] for this new objective function. 

Lemma 7 Let X £ R mxn and (3 £ [0,1]. Given a precision target e > 0, k > 1 and a matrix 
S £ R mxs constructed by subsampling the columns of X as in algorithm^ let n = 1 + Y / 81og(l//3) 
and 

. _ f ? (*»' ^rr^ ( Xf 

e 2 k 
where k(X) = o~i(X)/a r (X) with r = min {k, Rank(X)} ; we have 



E 

and 

with probability at least 1 — j3. 



£*LlM*)-^(S)|J <e 
k 

jMx)- ffi (s)|<6 
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Proof. Because Rank(5'S' r ) < Rank(XX T ) by construction, we always have 

k k 

£|o?(X)-o?(S)| = "EWiW-a^KatW + a^S)) 
i=i i=i 

> ^(x^i^po-^s)! 

i=l 

where r = min {k, Rank(X)}. Because the sum of the fc largest singular values is a unitarily 
invariant norm on S n (see [H.T91I §3.4]), Mirsky's theorem (see jSSM Th. 4.11] for example) shows 
that 

k k 

Y,\^(x)-af(s)\ = y,H xxT )- ^ ssT )\ 

i=i i=i 

k 

< ^<ji{XX T - SS T ) 
1=1 

and because, by construction, the range of SS T is included in the range of XX T , we must have 
Rank(XX T - SS T ) < Rank(XX T ) and 

k 

^<Ji{XX T - SS T ) < ^Rank(X) \\XX T - SS T \\ F 
i=l 

Jensen's inequality together with the matrix multiplication result in Lemma Q] yield 

B[\\SS T -XX T \\ F ] < Hft 

and 

\\ss T -xx T \\ F < 1 ^fi- 

Vs 

with probability at least 1 — f3. Combining these inequalities with the sampling rate in (|16p 

X|||Rank(X) 



e 2 a r {Xf 
and using 

\\X\\% NumRank(I) 2 4 

(E?=1<*P0) 2 M*) 2 - P 4 

yields the desired result. ■ 

Once again, the subsampling rate in the above lemma has a clear interpretation, 

e z k 

is the product of a term representing relative precision, a term reflecting the rank of X and a term 
in k(X) representing its (pseudo) condition number. Note that the bound can be further refined 
when a r < e. Lemma [7] allows us to compute the gradient by subsampling when using algorithm [3] 
to solve problem (|14ft . The remaining steps in the algorithm are identical, except that the matrix 
vv T is replaced by a combination of matrices formed using the k leading singular vectors. 
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4 Applications Sz numerical results 



In this section, we first detail a few instances of problem ([I]) arising in statistical learning. We then 
study the numerical performance of the methods detailed here on large scale problems. 

4.1 Spectral norm minimization 

For a given matrix AG S n , we begin by studying a simple instance of problem (JTJ) written 

minimize \\A + J7||2 ,^_n 
subject to \U{j\ < p, i,j = l,...,n 

in the matrix U £ S n . This problem is closely related to a relaxation for sparse PC A (see 
[dEGJL07]) and we use it in the next section to test the numerical performance of algorithm [3J 
The complexity of the main step in the algorithm (i.e. computing the gradient) is controlled by 
the sampling rate in Lemma El which is written 

II A _i_ TJ*\\ 2 
s = r? " „ 112 NumRankM + U*f 

where U* £ S n is the optimal solution to problem ()1T[) . 

4.2 Matrix factorization and collaborative filtering 

Matrix factorization methods have been heavily used to solve collaborative filtering problems (e.g. 
the Netflix problem) and we refer the reader to [Sre04j . |Bac07| . |RFP07j or |CK08j for details. All 
these references form a particular instance of problem (|14p , written 



minimize 



subject to y £ Q, 



J tr 



(18) 



in the variable y E R p where Q is a low dimension norm ball for example and the matrices Aj 
have a block format with only a few nonzero coefficients. Here, the trace norm can be understood 
as a convex lower bound on the rank function (as in [FHBOlj ) but sometimes also has a direct 
interpretation in terms of learning (see [Src04j ) . In this particular case, the complexity of the main 
step in the algorithm (i.e. computing the gradient) is controlled by the sampling rate in Lemma El 
which can be simplified here to 

a = n 2 ^^ K (Y*) 2 Rank(Y*) 

where Y* = Ylj=iVj A j + C and k(Y*) = a x {Y*) / a r {Y*) with r = Rank(Y*). The bound can 
be further refined when <j r < e. In practice, the complexity of solving problem (|18p can often be 
further reduced using the simple observation that an optimal solution of (I14p will also be optimal 
in (|18p whenever Rank(Y fc *) < k, where Yj* is the optimal solution to (|14p here. Once again, the 
sampling rate s has a natural interpretation as the product of a relative precision term, a term 
reflecting the condition number of the solution and the rank of the optimal solution. It means 
in particular that problems whose solutions have a lower rank are explicitly easier to solve than 
problems with more complex solutions. 



14 



4.3 LASSO 



Consider a particular instance of problem (|14p written 

minimize ||y||i < ig . 
subject to \\Ax — 6||a < cr 

in the variable y G R n , with A G R mxn , 6 G R m and a > 0. This is a (somewhat trivial) 
version of problem (|14p . where the matrices are diagonal and Q is an ellipsoid. This problem is 
directly related to LASSO, i.e. ^-penalized regression (see |Tib96| ) . In the diagonal case, the low 
rank matrix approximation produced by algorithm [2] simply picks s coefficients of y with probability 
proportional to their magnitude. Here, computing the gradient is trivial, but computational savings 
come from the fact that the bound on the quadratic variation of the gradient in ([5]) is now equal 
to the sampling rate, so M* = s in the complexity estimate ([9]). Since s is chosen as above, with 

s = 77 2 ^U(y*) 2 Card(i/*) 

where n(y*) = ym/yw) with r = Card(y*), this means that the complexity of solving problem (|19p 
is (explicitly) proportional to the cardinality of the solution Card(y*). Of course, this algorithm is 
not competitive with specialized algorithms for solving (|19p . but this subsampling bound provides 
some theoretical support for the empirical observations made in [DT06] using homotopy methods. 

4.4 Fastest mixing Markov chain on a graph 

As in [BDX04| . suppose we are given a connected graph with vertex set V = {1, . . . ,n} and edge 
set £ C V x V, with G £ 44> G £, where all vertices have a selfdoop. We define a Markov 
chain on this graph with transition probability matrix P G R nxn , where 

Pij = Prob(X m = j\X t =i), i,j = 1, ... ,n 

This matrix satisfies P = P T and PI = 1, which means that the equilibrium distribution of this 
Markov chain is uniform and the largest singular value of P is equal to one. The asymptotic rate 
of convergence of this Markov chain to its equilibrium distribution is controlled by the second 
singular value of P, with smaller values of 02 (-P) producing faster convergence. [ BDX04 ] exploited 
this property to show that the fastest mixing Markov chain on the graph (V,£) could be computed 
by minimizing 02 (-P) over all possible transition matrices on the graph, i.e. by solving 

minimize 02 (-P) 

subject to P > 0, PI = 1, P = P T , (20) 
Pij = 0, {i,j)i£ 

in the variable P G R nxn . The optimal mixing rate is often significantly faster than the rate 
provided by classical chains such as maximum degree or Metropolis-Hastings. Because ci(P) = 1 
here, this is a particular instance of problem (j!4[) where k = 2 and the matrices Aj are sparse. 
Projections on Q can be handled as in [BDX04| . Once again, the complexity of the main step in 
the algorithm (i.e. computing the gradient) is controlled by the sampling rate in Lemma [TJ which 
simplifies to 

s = n 2 - — I — ^NumRank(P*) 2 Rank(P*) 

e 2 o- 2 (P*) 

where P* is the transition matrix of the fastest mixing Markov chain. Once again, simpler transition 
matrices mean faster convergence. 
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4.5 Numerical experiments 



In this section, we test the quality of the subsampling approximations detailed in Section [2] on 
various matrices. We also evaluate the performance of the algorithms detailed above on large scale 
problem instances. Numerical code reproducing these experiments is available from the author's 
webpage. 



Randomized low-rank approximations. Here, we first measure the quality of the randomized 
low-rank matrix approximation on both randomly generated matrices and on covariance matrices 
formed using gene expression data. Because the spectrum of naive large scale random matrices 
is very structured, these examples are too simple to appropriately benchmark numerical error in 
algorithm [2j Fortunately, as we will see below, generating random symmetric matrices with a given 
spectral measure is straightforward. 

Suppose X E S n is a matrix with normally distributed coefficients, ~ A/"(0, 1), i, j = 1, . . . , n. 
If we write its QR decomposition, X = QR with Q, R E R nxn , then the orthogonal matrix Q 
is Haar distributed on the orthogonal group O n (see }Dia03j for example). This means that to 
generate a random matrix with given spectrum fi E R n , we generate a normally distributed matrix 
X, compute its QR decomposition and the matrix Q diag(fi)Q T will be uniformly distributed on the 
set of symmetric matrices with spectrum fj,. Because the spectral measure of "natural" covariance 
matrices often follows a power law (Tracy- Widom in the Gaussian case, see [JohOlj and |EK07j for 
a discussion), we sample the spectrum jit from a beta distribution with various exponents to get 
realistic random matrices with a broad range of numerical ranks. We also use a covariance matrix 
formed using the gene expression data set in jABN+99] . 

In Figure [U we plot relative error e/||X||2 versus numerical rank NumRank(X) in loglog scale 
with 20% subsampling and n = 500 on random matrices generated as above and on the gene ex- 
pression covariance from [A BN + 99] . We notice that, on these experiments, the relative error grows 
at most linearly with the numerical rank of the matrix, as predicted by Lemma [3l We then plot the 
histogram in semilog scale of relative error e/||X||2 over theoretical bound j)NumRank(X)/^/s 
for random matrices with n = 500. In Figure [2j we plot relative error e/||X||2 versus sampling rate 
s, in loglog scale, for a gene expression covariance with n = 500. Once gain, the error decreases 
as l/y^ as predicted by Lemma El We also plot the median speedup factor (over ten runs) in 
computing largest magnitude eigenvalues using ARPACK with and without subsampling on a gene 
expression covariance matrix with n = 2000, for various values of the sampling ratio s/n. Note 
that both exact and subsampled eigenvalues are computed using direct MEX calls to ARPACK 
by [LSY98j . as eigs (MATLAB's interface to ARPACK) carries a massive overhead. In all the 
experiments above, the confidence level used in computing r/ was set to 



Stochastic approximation with subsampling. In Figure [3l we generate a sample ratings 
matrix X = VV for the collaborative filtering problem in §4.21 where V is a discrete feature 
matrix V E {0, 1, 2} 100x3 . We "observe" only 30% of the ratings and solve problem (fT4"j) with k = 4 
to approximately reconstruct the full ratings matrix. We plot objective value versus CPU time in 
seconds for this sample matrix factorization problem, using a stochastic approximation algorithm 
with deterministic gradient or the subsampled gradient algorithm [3] with subsampling ratio s\/n 
set at 20%. We also plot surrogate duality gap versus CPU time on the same example. We notice 
that while the subsampled algorithm converges much faster than the deterministic one, the quality 
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io° io 1 icr 4 10- 3 nr 2 10-' 

NumRank(l) Error / Theoretical error 



Figure 1: Left: Loglog plot of relative error e/||X||2 versus numerical rank NumRank(J) 
with 20% subsampling and n = 500 on random matrices (blue dots) and gene expression 
covariance (red square). The dashed line has slope one in loglog scale. Right: Histogram 
plot in semilog scale of relative error e/||X||2 over theoretical bound 7yNumRank(X)/y / s 
for random matrices with n = 500. 




Sampling rate s Sampling ratio s/n 

Figure 2: Left: Loglog plot of relative error e/||X||2 versus sampling rate s for a gene 
expression covariance with n = 500. The dashed line has slope -1/2 in loglog scale. Right: 
Plot of median speedup factor in computing largest magnitude eigenvalue, using ARPACK 
with and without subsampling on a gene expression covariance matrix with n — 2000, for 
various values of the sampling ratio s/n. 



17 




50 100 150 200 250 50 100 150 200 250 



CPU time (sees.) CPU time (sees.) 

Figure 3: Left: Objective value versus CPU for a sample matrix factorization problem 
in dimension 100, using a deterministic gradient (squares) or a subsampled gradient with 
subsampling rate set at 20% (circles). Right: Surrogate duality gap versus CPU time on 
the same example. 



of the surrogate dual points and duality gap produced using subsampled gradients as in §3.31 is 
worst than in the deterministic case. 

In Table O using the same 20% sampling rate we compare CPU time versus problem dimension 
n for subsampled and deterministic algorithms when solving the following instance of problem ([T|) 

minimize ||C + X||2 
subject to H^Hoo < P 

in the variable X € S n where C is a covariance matrix constructed using a subset of size n of the 
var iables in [ABN+99| . for various values of n. Finally, we generate and solve sample collaborative 
filtering problems as in (|14h for ratings matrix of various dimensions n. We report median CPU 
time over ten sample problems in Table [3l Here, subsampling speeds up the algorithm by an order 
of magnitude, however the stochastic approximation algorithm is still not competitive with (non 
convex) local minimization techniques over low rank matrices. 



n 


Deterministic 


Subsampling 


Speedup factor 


500 


5 


5 


0.92 


750 


19 


13 


1.40 


1000 


32 


24 


1.31 


1500 


107 


58 


1.84 


2000 


281 


120 


2.34 



Table 2: CPU time (in seconds) versus problem dimension n for deterministic and sub- 
sampled stochastic approximation algorithms on spectral norm minimization problems. 
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n 



Deterministic 



Subsampling 



Speedup factor 



100 
200 
500 



154 
766 
4290 



23 
63 
338 



6.67 
12.2 
12.7 



Table 3: Median CPU time (in seconds) versus problem dimension n for deterministic and 
subsampled stochastic approximation algorithms on collaborative filtering problems. 



5 Appendix 

The complexity results detailed above heavily rely on the fact that extracting one leading eigen- 
vector of a symmetric matrix X £ S n can be done by computing a few matrix vector products. 
While this simple fact is easy to prove using the power method when the eigenvalues of X are well 
separated, the problem becomes significantly more delicate when the spectrum of X is clustered. 
The section that follows briefly summarizes how modern numerical methods solve this problem in 
practice. 

5.1 Computing one leading eigenvector of a symmetric matrix 

We start by recalling how packages such as LAPACK [ABB ^99] form a full eigenvalue (or Schur) 
decomposition of a symmetric matrix X £ S n . The algorithm is strikingly stable and, despite its 
0(n 3 ) complexity, often competitive with more advanced techniques when the matrix X is small. 
We then discuss the problem of approximating one leading eigenpair of X using Krylov subspace 
methods with complexity growing as 0(n 2 log n) with the dimension (or less when the matrix is 
structured). 

Full eigenvalue decomposition. Full eigenvalue decompositions are computed by first reducing 
the matrix X to symmetric tridiagonal form using Householder transformations, then diagonalizing 
the tridiagonal factor using iterative techniques such as the QR or divide and conquer methods for 
example (see [SteOH Chap. 3] for an overview). The classical QR algorithm (see |GVL90t §8.3]) 
implicitly relied on power iterations to compute the eigenvalues and eigenvectors of a symmetric 
tridiagonal matrix with complexity 0(n 3 ), however more recent methods such as the MRRR al- 
gorithm by |DP03] solve this problem with complexity 0(n 2 ). Starting with the third version of 
LAPACK, the MRRR method is part of the default routine for diagonalizing a symmetric matrix 
and is implemented in the STEGR driver (sec [DPV06J). 

Overall, the complexity of forming a full Schur decomposition of a symmetric matrix X £ S n 
is then 4n 3 /3 flops for the Householder tridiagonalization, followed by 0(n 2 ) flops for the Schur 
decomposition of the tridiagonal matrix using the MRRR algorithm. 

Computing one leading eigenpair. We now give a brief overview of the complexity of com- 
puting leading eigenpairs using Krylov subspace methods and we refer the reader to [SteOl, §4.3], 
|GVL90t §8.3, §9.1.1] or |Saa92] for a more complete discussion. Let u £ R™ be a given vector, we 
form the following Krylov sequence 
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by computing k matrix vector products. If we call tCk(X, u) the subspace generated by these vectors 
and write X = Y^i=\ \ x i x J a spectral decomposition of X, assuming, for now, that 

Ai > A2 > . . . > A n , 

one can show using Chebyshev polynomials (see e.g. |Ste01 j, §4.3.2] for details) that 

+ /i v (v \\ < tanZ(xi,n) Xy - A 2 
tanZ (xi,/C fc (A,u)) < - [ j— where 77 = - — - 

1 • 2yfif+^ ' ' 



in other words, Krylov subspaces contain excellent approximations of leading eigenpairs of X. 

This result is exploited by the Lanczos procedure to extract approximate eigenpairs of X called 
Ritz pairs (see [GVL901 Chap. 9] or [§5.1.2] [SteOl] for a complete discussion). In practice, the 
matrix formed by the Krylov sequence is very ill-conditioned (as X k u gets increasingly close to the 
leading eigenvector) so the Lanczos algorithm simultaneously updates an orthogonormal basis for 
JCk(X, u) and a partial tridiagonalization of X. The Lanczos procedure is described in Algorithm 2] 
and requires k matrix vector products and an additional 4nk flops. Note that the only way in which 
the data in X is accessed is through the matrix vector products Xuj . 



Algorithm 4 Lanczos decomposition. 

Input: Matrices X £ S n and initial vector u\ £ R™. 

1: Set no = and (3q = 0. 

2: for j = 1 to k do 

3: Compute v = Xuj. 

4: Set ctj = ujv. 

5: Update v = v — ctjUj — f3j-\Uj-i. 

6: Set j3j = \\v\\2- 
7: Set = v/Pj. 
8: end for 
Output: A Lanczos decomposition 

XUk = UkTk + Pk u k+iGk, 
where £/& G j^ nxfc j s orthogonal and Tfc £ Sfc is symmetric tridiagonal. 



In theory, one could then diagonalize the matrix (which costs 0(k 2 ) as we have seen above) 
to produce Ritz vectors. In practice, key numerical difficulties often arise. First, finite precision 
arithmetics cause a significant loss of orthogonality in £/&. This is remedied by various reorthogonal- 
ization strategies (cf. [SteOli §5.3.1]). A more serious problem is clustered or multiple eigenvalues 
in the spectrum periphery. In fact, it is easy to see that Krylov subspace methods cannot isolate 
multiple eigenvalues. Assume that the leading eigenvalue has multiplicity two for example, we then 
have 

A k u = {{x[u)xi + {xlu)x 2 )\ k 1 + (xj u)x 3 \% + ... + (xlu)x n \ k n 

and everything happens as if the eigenvalue Ai was simple and the matrix X had a larger nullspace. 
This is not a problem in our case, since we need only one eigenvector in the leading invariant 
subspace, not the entire eigenspace. 
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Clustered eigenvalues (i.e. a small gap between the leading eigenvalue and the next one, not 
counting multiplicities) are much more problematic. The convergence of Ritz vectors cannot be 
established by the classical Chebyshev bounds described above, and various references provide a 
more refined analysis of this scenario (see |PSS82j . [VdSVdV87| . [KW92] among others). Success- 
ful termination of a deterministic Lanczos method can never be guaranteed anyway, since in the 
extreme case where the starting vector is orthogonal to the leading eigenspace, the Krylov subspace 
contains no information about leading eigenpairs. In practice, Lanczos solvers use random initial 
points. In particular, K \Y!)2. Th.4.2] show that, for any matrix I 6 S„ (including matrices with 
clustered spectrum), starting the algorithm at a random u\ picked uniformly over the sphere means 
the Lanczos decomposition will produce a leading Ritz pair with relative precision e in 

iterations, with probability at least 1 — 5. This is of course a highly conservative bound and in 
particular, the worst case matrices used to prove it vary with k. 

This means that computing one leading eigenpair of the matrix X requires computing at most 
k Lan matrix vector products (we can always restart the code in case of failure) plus Ank Lan flops. 
When the matrix is dense, each matrix vector product costs n 2 flops, hence the total cost of 
computing one leading eigenpair of X is 

^ n 2 log(n/^ 

flops. When the matrix is sparse, the cost of each matrix vector product is O(s) instead of 0(n 2 ), 
where s is the number of nonzero coefficients in X. Idem when the matrix X has rank r < n and an 
explicit factorization is known (which is the case in the algorithms detailed in the previous section) , 
in which case each matrix vector product costs 0{nr) which is the cost of two n by r matrix vector 
products, and the complexity of the Lanczos procedure decreases accordingly. 

The numerical package ARPACK by [LSY98J implements the Lanczos procedure with a reverse 
communication interface allowing the user to efficiently compute the matrix vector product Xuj. 
However, it uses the implicitly shifted QR method instead of the more efficient MRRR algorithm 
to compute the Ritz pairs of the matrix 6 



5.2 Other sampling techniques 

For completeness, we recall below another subsampling procedure in |AM07] . More recent "volume 
sampling" techniques produce improved error bounds (some with multiplicative error bounds) but 
the corresponding optimal sampling probabilities are much harder to compute, we refer the reader 
to |KV09| for more details. The key idea behind this result is that, as the matrix dimension n 
grows and given a fixed, scale invariant precision target ||X||f/€, the norm HXHoo of individual 
coefficients in X typically becomes negligible and we can randomly discard the majority of them 
while keeping important spectral features of X mostly intact. 

Lemma 8 Given X £ S n and e > 0, we define a subsampled matrix S whose coefficients are 
independently distributed as: 

„ _ f Xij/p with probability p, , , 

13 1 otherwise. 
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when i > j, and Sij = Sji otherwise. Assume that 1 > p > (8 log n) 4 /n ; then 

\\x - S\\ 2 <M\X\\ 00 y^/p'. 
with probability at least 1 — exp(— 19(log n) 4 ). 
Proof. See |AM07l Th. 1.4]. ■ 

At first sight here, bounding the approximation error means letting the probability p grow 
relatively fast as n tends to infinity. However, because HXHoo/e is typically much smaller than 
||X||i?/e, this subsampling ratio p can often be controlled. Adaptive subsampling, i.e. letting p 
vary with the magnitude of the coefficients in X, can further improve these results (see [AM07} §4] 
for details). The average number of nonzero coefficients in the subsampled matrix can be bounded 
using the structure of X. Note that the constants in this result are all very large (in particular, 
1 > P > (8 log n) 4 /n implies n > 10 9 ) so despite its good empirical performance in low dimensions, 
the result presented above has to be understood in an asymptotic sense. 
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